0%

(ICML 2016) Meta-learning with memory-augmented neural networks

Keyword [MANN (Memory-Augmented Neural Network)] [Memory] [NTM (Neural Turing Machines)]

Santoro, Adam, Bartunov, Sergey, Botvinick, Matthew, Wierstra, Daan, and Lillicrap, Timothy. Meta-learning with memory-augmented neural networks. In Proceedings of The 33rd International Conference on Machine Learning, pp. 1842–1850, 2016.



1. Overview


1.1. Motivation

  • When new data is encountered, the models must inefficiently relearn

In this paper, it proposes Memory-Augmented Neural Network (MANN), such as NTM

  • quickly encode and retrieve new information
  • rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples without re-training (one-shot learning)
  • introduce a new method (Least Recently Used Access) for accessing an external memory
  1. slowly learn an abstract method for obtaining useful representations of raw data, via gradient descent
  2. rapidly bind never-before-seen information after a single presentation, via an external memory module

1.2. Meta-Learning Task Methodology



  • offset input. label are shuffled from dataset-to-dataset



  • must learn to hold data samples in memory until the appropriate labels are presented at the next time

1.2.1. Input

[batch size, length of an episode, hwc+class_nb]

  • for exmaple, [16, 50, 20201+5] for Omniglot dataset in the paper
  • class_nb at time t is the label of input image at time t-1

1.2.2. Output

[batch size, length of an episode, class_nb of an episode]

  • for exmaple, [16, 50, 5]. Only 5 class in an episode whose length is 50 (input images)

For a given episode, ideal performance involves a random guess for the first presentation of a class, and use of memory to achieve perfect accuracy thereafter.

1.3. Neural Turing Machines

Memory encoding and retrieval in a NTM external memory module is rapid.



1.3.1. Memory Read



  • r. read memory

1.4. Least Recently Used Access

  • content-based memory writer

1.4.1. Usage Weights



  • wu. usage weights
  • wr. read weights
  • ww. write weights
  • γ. decay parameter, 0.95 in this paper

1.4.2. Least-Used Weights



  • wlu. least-used weights
  • m(v, n). nth smallest element of the vector v

1.4.3. Write Weights



  • sigma. sigmoid function
  • α. learnable gate parameter

1.4.4. Memory Write



  • prior to writing to memory, the least used memory location is set to zero

1.5. Access Module



Code

Input. (16, 50, 2020)
(a) reshape to (50, 16, 20
20)
(b) each time (16, 20*20) come into Access Module and output

  • M_t (16, 128, 40)
  • c_t (16, 200)
  • h_t (16, 200)
  • r_t (16, 4*40)
  • wr_t (16, 4, 128)
  • wu_t (16, 128)

(c) for 50 times, total get

  • M (50, 16, 128, 40)
  • c (50, 16, 200)
  • h (50, 16, 200)
  • r (50, 16, 4*40)
  • wr (50, 16, 4, 128)
  • wu (50, 16, 128)

(d) cat r and h to get (50, 16, 200+160)
(e) (50, 16, 360) · (360, 5) → (50, 16, 5) → (16, 50, 5)
(f) calculate loss, BP and update



2. Experiments


2.1. DataSet

2.1.1. Omniglot

  • contains over 1600 separate classes wuth only a few examples per class, aptly leading to it being called the transpose of MNIST
  • apply data augmentation. translate and rotate
  • create new class through 90°, 180° and 270° rotations
  • 1200 classes for training, 423 classes for testing
  • image downscale to 20x20
  • the classes in testing set are different from classes in training set
  • In testing set, each episode contains unique classes.

2.2. Performance



  • x-axis. training episode. when entering a new episode, the memory will be set to 0
  • y-axis. testing performance
  • nst Instance. during a certain episode, it is the nst time to see the sample of one class.
  • five-character string label. to reduce the length of the one-hot label, it uses 5-length string to represent labels. 5 position for each contains 5 possible character→ total 5^5 class can be represent



  • MANN with standard NTM access module is worse than with LRU Access

2.3. Persistent Memory Inference

  • As each episode contains unique classes, wipe memory from episode to episode is important
  • It becomes worse without wiping memory


2.4. Curriculum Training

  • first tasked to classify fiftenn classes per episode
  • every 10,000 episodes of training thereafter, the maximum number of classes presented per episode incremented by one